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Abstract. To better understand the interplay of censoring and sparsity we develop finite sample 
properties of nonparametric Cox proportional hazard's model. Due to high impact of sequencing 
data, carrying genetic information of each individual, we work with over-parametrized problem 
and propose general class of group penalties suitable for sparse structured variable selection and 
estimation. Novel non-asymptotic sandwich bounds for the partial likelihood are developed. We 
establish how they extend notion of local asymptotic normality (LAN) of Le Cam's. Such non- 
asymptotic LAN principles are further extended to high dimensional spaces where p 2> n. Finite 
sample prediction properties of penalized estimator in non-parametric Cox proportional hazards 
model, under suitable censoring conditions, agree with those of penalized estimator in linear 
models. 

1. Introduction 

Sparse linear models have emerged as powerful framework to deal with high dimensional problems 
where the size of parameter space p is much larger than the sample size n, i.e. for the case oip ^ n. 
Its applications range from machine learning to signal processing and provide a necessary tool to 
analyze new, high-throughput data emerging from various scientific fields through online auction 
data or functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) signals 
or new techniques of mRNA sequencing. The approach of li regularization has become a popular 
tool to efficiently address such problems by assuming sparsity in parameter space (Tibshirani, 1996; 
Fan and Li, 2001; Meinshausen and Yu, 2009; Lv and Fan, 2009; Zhou, 2010). If number of sparse 
elements is not our only constraint different forms of group or hierarchical regularization have 
been proposed to handle structured information (see for example Yuan and Lin (2006); Zhao et al. 
(2009)). In the general sparsity setting, finite sample predictive properties of penalized least squares 
procedures have been well studied and understood. Finite sample oracle inequalities with and 
without group structure have been obtained even for ovcrparametrized problems where n (see 
for example Bickel et al. (2009); Meier et al. (2009); Ravikumar et al. (2009); Kolar et al. (2011); 
Jcnatton et al. (2011); Lounici et al. (2011); Raskutti et al. (2012)). While a large body of work 
has focused on parametric and non-parametric linear models, censored high dimensional data have 
been left fairly unexplored when p ^ n. Moreover, driven by the known affect that censoring 
rate has on sample reduction, we were interested in discovering finite sample properties of censored 
models and their possible dependence on censoring rate. 

Estimation of high-dimensional proportional hazards ratio can be classified into two categories: 
univariate and regularized multivariate methods and their analysis into asymptotic and non-asymptotic 
kind. Univariate or marginal methods in predicting genetic abundances that arc linked to survival 
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have been used for decades without much understanding of the underlying possibihty for severe 
overestimation. Zhao and Li (2012) outhned basic assumptions needed for univariate hazards mod- 
els to have model selection consistency. As expected conditions are quite restrictive and signifi- 
cantly impair the applicability of such methodology. Our focus here is non-asymptotic analysis of 
multivariate regularized methods for right censored data where we aim for an estimator that can 
accurately predict nonparamctric covariate effects. Popular approach is to look at Zi-norm regular- 
ized log likelihood (see Tibshirani (1996); Fan and Li (2001); Yuan and Lin (2006)). Multivariate 
methods have been analyzed recently in Bradic et al. (2011) where the emphasis was on asymptot- 
ical model selection properties of non-convex regularized partial likelihood. Non-asymptotic oracle 
properties have been developed for different additive hazards models (Gaiffas and Guilloux, 2012; 
Kong and Nan, 2012; Lemler, 2012) where weighted Zinorm penalty was employed to encourage 
sparsity and likelihood function was approximated with I2 type of criterion. To the best of our 
knowledge non-asymptotic, finite sample prediction properties of multivariate methods for Cox 
like partial likelihoods have not been studied when the dimensionality of the covariates is ultra- 
high. They pose significant challenges as the partial likelihood doesn't obey I2 criterion and strict 
convexity cannot be guaranteed in the whole high dimensional space. 

To that end, we consider bivariatc data {{Xi, Ti) : i — 1, ■ ■ ■ , n}, which form an i.i.d sample from 
the population (X, T) and where = {Xn, • • • , Xip)^ is a column vector of p-covariate processes 
for the i-th individual. We are interested in the cases where not all survival times (li)"^-^ are fully 
observable and where independent right censoring scheme is assumed (censoring times (Ci)^^^ are 
conditionally on (X^)"^]^ independent of survival times (Ti)"^-^). Hence we work with i.i.d sample 
of the form 

{(Xi,Zi,Si) : i = 1, • • • 

where Zi = min(Ti,Ci) and Si = l{Ti < Ci} are event times and censoring indicator respectively. 
Note that we do not allow for time-dependent covariates, and will work under the boundedness 
assumptions of the form X = [a,b]P (for some constants a,b). Conditional hazard function of T 
given X = x is denoted as A(t|a;) and is defined as instantaneous rate of failure at time t given a 
particular value x of covariate X. We are interested in non-parametric hazards model, where the 
effect of covariate on the intensity process can be described as : 

(1) Xit\x) = Xoit)gix), 

for a baseline hazard function Xo{t) as conditional hazard function of T given a; = (common 
covariate effect on the survival time) and a relative risk function g : TIP -> 7?,+ . 

We propose a general methodology for analyzing such complex models, with identifying local 
neighborhood where strict convexity can be guaranteed and then analyzing predictive properties 
in the whole space by chaining through local neighborhood. Contributions of our paper are three 
fold. 

(1) We establish new general and sparse oracle inequalities for high dimensional regularized right 
censored non-parametric Cox model (1). New type of empirical functional norms are defined and 
used to describe the predictive performance of the proposed estimator. Classical GOI and SOI 
results, i.e. those obtained for least squares regression problems, become only local properties in 
models with complex log-likelihood structure. Locality is increased with the increase of dimension- 
ality p of the problem. 

(2) Wc develop new two-stage technique and extend Lc Cam's (Le Cam, 1960) Local Asymptotic 
Normality (LAN) to high dimensional spaces through something wc call, local non-asymptotic 
approach. We discuss this approach through Cox model where we prove sandwich bounds for 
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log partial-likelihood, bounding it from below and from above with quadratic random processes. 
First layer is composed of processes that are not necessarily i.i.d and not necessarily generated by 
Gaussian linear models. The second layer is developed with the help of the first and is composed of 
processes that are truly Gaussian in nature. This new technique provides tools for connecting GDI 
and SOI results obtained for this model and for least squares regression models. Interestingly, we 
show that this connection holds in local elliptical neighborhood of (3* and doesn't not hold outside 
of such sets if the dimensionality is high. By localizing our estimator we were able to extend this 
new idea to problems where n but logp < n. 

(3) We show explicitly how new non-asymptotic predictive performance of penalized estimator 
is connected to the censoring rate of the model. Asymptotic theory is unable to retrieve such 
connections due to disappearance of censoring rate with the increase of the sample size. To that 
end results presented in this paper are novel and unique. They provide direct relationship between 
finite sample risk properties and observed number of uncensored data. This relationship is linear 
in local neighborhood where local non-asymptotic normality holds. Over the whole space, the 
relationship is more complicated and is directly embedded in the proposed risk functions. 

To handle high dimensionality of the problem we introduce general group penalty where we 
treat differently convex and non-convex types of regularization. Closer relations to our work 
are primary regarding sparse oracle inequalities for regularized Gaussian type log-likelihoods (see 
for example van de Geer (2008); Bickel et al. (2009); Bunea et al. (2009); Jenatton et al. (2011); 
Gai'ffas and Guilloux (2012)) where the quadratic structure of the loss function eases of technical 
difficulties. Localization technique we developed is complementary to the recent work of Spokoiny 
(2012) but is based on different non-asymptotic scheme that greatly relaxes assumptions and sim- 
plifies the structure of the bounding processes presented in the latter paper. Additionally, we 
make connections to the likelihoods that become truly Gaussian in the full observational study, and 
further extend our method to the problems with high dimensional structure. 

The paper is organized as follows. In this Section we introduce notation and model setup and 
define new empirical functional norms adapted to the setting of censored data. In Section 2 we 
define local neighborhood and propose non-asymptotic extension of LAN property with two types 
of sandwich bounds. In Section 3 and 4 we develop slow and fast rate convergent novel oracle 
inequalities. We use these inequalities to localize our estimator to small neighborhood where we 
prove gaussian type prediction properties. In that way we show that even in high dimensional 
spaces, when censoring is not severe, gaussian type of prediction properties hold in sparse censored 
Cox proportional hazards models. They match Gaussian case, where we observe all data points. In 
Section 5 we propose non-convex regularized estimator and further extend previous finite sample 
oracle inequalities. Section 6 is left for particular examples . 

1.1. Model Formulation. Let us consider high dimensional, non- parametric Cox hazards regres- 
sion model, where the hazard rate function A(i|a;) takes the form X(t\x) = Xo{t)g{x) with g(x) € Qn, 

(2) g„ - {g : [e^e'']P ^ 7^ : \ogg{x) = f{x)Je -F„}, 

where J-'n is the collection of functions / on [a, 6]^ with the additive structure f{x) = j\(x\) -V 
■ ■ ■ + fpi^p)- Goal is to estimate unknown functions fj{x) using best linear combination of a finite 
dictionary of candidate functions ^'i(a;), • • • , ^'d(a;). Constraints on the candidate functions ^' are 
that they are known a priori and bounded above with a constant C. Note that we can always center 
the data so that they are mean zero. In this way we want to design a statistical learning procedure 
that adapts to the unknown functions fj for which we only assume it belongs to a linear space 
spanned by our dictionary (see for example Bunea et al. (2007), Rigollct (2012)). By standard 
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notation let us define N,{t) = 1{Z^ < t,6i = 1}, N{t) = X]"=i ^^(0: and Y,{t) = 1{Z, > t}. 
We may then write the likelihood as 



l-A7V(t) 



(3) = nn[^oWe^p{b^*(^'^)}]^'^"^*^ i-Ao(05(")(b,t) 



where we take the following vector notation b = [bf jbj | • • • |bp ] with bj = {hji, • • • , hjd) 

and *(X,0 = • • • with = • • • , 

and for / = 0, 1, 2 with ® denoting outer product 



1 " 

5i')(b,t) - -^y,(t)vI/«'=(X,)exp{b^*(X,)}. 



The Nelson- Aalen estimator of the integrated underlying hazard function Ao(<,b) = J^dN{s)x 
ffi\ — 1 

xSn (b,s) becomes Breslow type of estimator which allows us to write the normalized log 
partial likelihood as 



(4) 



£„(b,r) = lv/ h^^{Xi)dN,{t)- f log5i°)(b,t)d7V(i), 
n frt Jo Jo 



where and hereafter r is fixed study end time. 

1.2. Family of Folded Group Penalties. Let us define the empirical risk function 7?.„(b) as 
negative of log partial likelihood 

(5) 7^„(b) = -/:„(b), 

where from now on for every fixed time window r, £„(b, r) ~ £„(b). Since the maximization of the 
log-likelihood is an overparametrized problem for p ^ n wc introduce a penalty structure P{h) and 
hence search for the estimator (3 of the unknown parameter, as a solution of the following problem 

(6) min {7^„(b) + A„P(b)}, 

beKp*'' 

where the penalty function P(b) is defined as 

(7) P{h) - f]df, . p (||b,||,J = f]df, . pi{J2 hkP^Y^'")- 

j=i 'j=i k=i 

Parameters dfj arc the group scaling corresponding to the degrees of freedom of each group. The 
typical choice of dfj, for convex functions p, is d^/'^i , for 7* being a Hoelder conjugate of 7^, to 
ensure that the penalty term is of the order of the number of parameters dfj. 

The additive model of the hazard function A in (2) induces the natural grouping structure over 
the parameter space {^jfejj'fc^i at the functional level. When regularizing by Lasso penalty, sparsity 
is induced by treating each variable individually, and existing relationships and structures between 
the variables (spatial, hierarchical or related to the physics of the problem) are merely disregarded 
(see Zhao et al. (2009); Jcnatton et al. (2011)). Hence, a penalty that encourages sparsity across 
functional groups is more suitable. 
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A family of folded group penalty functions (FGP) (7) takes into account existing structure in 
the parameter space and encourages sparsity in the additive model. In its full generality the family 
of FGP penalties, can induce a wide variety of grouping structures in the coefficients {bjk}^''^^i- 
p determines how the groups relate to one another, while {L^.}^^^-^ norms dictate the relationship 
of the cocfhcicnts among each group j. In this way, while the number of groups is minimized 
whereas the variables in each group arc selected as a whole block. This follows easily from the well 
known properties of bridge estimators for 7^ > 1 , where 7j determines how close together the size 
of coefhcients in each selected group are kept. 

In that respect for specific functions p and parameters 7^ FGP (7) reduces to a number of al- 
ready proposed group penalties. For example, for p = Li, and any 7^ it reduces to the CAP 
family XJ^^^i Il./jll7j of Zhao et al. (2009); for p ~ ii,7j = 2 it becomes the group Lasso penalty 
AX;^^i ll/jl|2 of Yuan and Lin (2006); Obozinski et al. (2010); Lounici et al. (2011); for p = Li and 
7j = 00 it reduces to block h/loo penalty A J2^=i II /ill 00 of Zhang et al. (2008) and Negahban and Wainwright 
(2011). Moreover, the problem can be reparametrized to include a scaling in the penalty function of 
the following form p(||bjR, 11^^.) with R, = (Ri(X,)| • • • |Rd(X,))^, Rfe(X,) ^ (i?fc(X,i), • • • ,i?fe(X„,))^, 

for some weighting functions {Rk}'^=i -TZ ^ TZ. Similar scaling was used in structured group Lasso 
in van de Geer (2011) where it was combined will plain Lasso estimator to obtain sparse solutions. 
Moreover, as smoothing is necessary for the B-spline basis expansion Meier et al. (2009), the penalty 

in (7) can be easily adapted to include such scaling as p ^||Rjbj||-y^ + y^bjMjbj^ where the dx d 

smoothing matrix {Mj}/j; = J 'I'j.(xj)'I'; {xj)dxj. All the results in the paper can be extended to 
hold for both such extensions (see Section 6), but for simplicity of notation we will be working with 
the penalty function as defined in (7). 

1.3. Notation. Let b stand for the p x d vector of parameters {^jTcj^'^Li such that fh{x) = 
^^^•j^ X]fc=i bjk'i'kix). Let f^{x) now denote the p-dimensional additive approximation of the un- 
known function f{x) with the following structure 

p d 

(8) r(x) = /^(x)=^^/3,feM',(x,), 

j=i k=i 

where the unknown parameter vector /3 = • • • |/3p] and Pj = (/3ji, • • • , Pjd)'^ ■ With this we 
have moved away from the additive model (2) to the fully nonparametric model with X{t\x) = 
Ao(t) exp{/(a;)}. To avoid curse of dimensionality (commonly recognized in nonparametric prob- 
lems), the risk function in (5) serves only as a proxy to the fully nonparametric Cox negative 
partial likelihood. Naturally, by allowing p ^ n we are able to achieve better approximation to the 
maximum of the fully nonparametric partial log likelihood. With this notation we have 

p d 

r(XO = fp{X.,) = £^/3,,vl/,(x,,) = /3^*(X,), 

J=l k=l 

for a p * d dimensional vector '4'(Xi) as described in the description of equation (9). In that 
sense, let /3* denote the sparse alternative to /3, which corresponds to ff3'{x) being sparse equiva- 
lent of the additive function f^{x) i.e. sparse additive approximation of the unknown function 
f{x) (see Meier et al. (2009)). This sparse approximation, fp-{x), is equal to = 
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i-e. /^.(X.) = /3*^*(X0 with (3* = [f}f\f3f\ |/3f |0^| • • • |0^], 

(3* = • • • , /3,d)^, and = (0, • • • , 0)^ G 7^'^. 

Let us define an empirical functional norm || • ||„.b* of all functions /b : Rp*'^ R for any h & R^ 
and a fixed b* e R^*'^ as 

ll/b||^..b. = -E /V,(iK(b*,t)/2(X,)dA^W- [-E /V,(tK(b*,i)/b(X,)diV(i)j', 
for nonnegative weight process 

uj,{h* , t) = exp{/b. (X,)}/5i°^ (b* , t). 

The intiution behind the introduction of this norm is given in Section 2.1. It will be crucial in 
analyzing fine non-asymptotic properties of penalized estimator /3 and is connected to the curvature 
of the log- likelihood process. Because no two counting processes Ni{t) jump at the same time, the 
following holds 

where f^{t) — -Y^^=iYi{t)LL!i{h*,t)f\j{Xi) can be understood as a process of empirical weighted 
averages of /b- Notice that, if Condition 1 is satisfied (see Section 2.1) the introduced empirical 
norm exhibits all the properties of proper norm. It is nonnegative, ||/b||^i b* ~ ^ ^'^^ every b* if and 
only if b = (see Condition l(iii) in Section 2.1) and satisfies triangular inequality ||./bi ^ /b2 l|n,b* < 
||/bi||n.b' + ||/b2llii,b* for cvcry bi,b2 and any fixed b* (by simple algebraic manipulations). 

Let us denote with ti < ■ ■ ■ < ordered failure times and with TZq — {i {I, . . . ,n} : Zi > tq} 
risk sets. Then we can define a I2 empirical functional norm || • ||^ for the case of censored data as 
follows 

1 " 

(10) |l/b|l?, = -Ei{*ei}/te), 

i=i 

where I = {i e {1, . . . , n} : i G U^;^?^^}. If we are wiling to assume that I = {1, • • • ,n} (as it 
seems natural to consider only patients that belong to at least one risk set), the previous definition 
(10) matches classical functional I2 norm of the form i X]"=i /b(Xi)- 

2. Local Non-Asymptotic Bounds 

This chapter expands the ideas of Le'Cam's local asymptotic normality (LAN) in two directions. 
First, it develops a non-asympototic equivalent of LAN techniques where instead of proving conver- 
gence to a single gaussian event, we show finite sample gaussian sandwich bounds. Finite sample 
techniques are important in modern high throughput data where dimensionality of the parameter 
space prohibits us to analyze methods asymptotically. In high dimensional problems, it is not clear 
if local perturbations of i/\/n rate are the optimal ones as our intuition suggests that the rates of 
convergence might be of the order of ylogp/n when p ^ n but logp < n and trivial extension 
of classical asymptotic approach fails due to vectors and matrices of exploding size. Hence, we 
develop non-asymptotic extension of it, that allows us to define finite sample sandwich bounds on 
the likelihood process. 

We show that dimensionality p plays crucial role even in local bounds and that local neighborhood 
of Le'Cam type are inherently dimensionality dependent in cases where model structure is complex 
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and where p ^ n (see Section 3 for more details). Furthermore, we take these local bounds and 
combine them with sparse oracle inequalities to show that penalized estimator can have predictive 
properties close to the oracle estimator of gaussian linear models, i.e. simple linear model with 
Gaussian errors. In this sense, we extend LAN idea to overparamterized problems with n (see 
Section 4 for more details). 

2.1. Local Non- Asymptotic Quadraticity. In this section we will show important quadratic 
interpretation of the risk function 7?.„(b). We will show work for the partial log-likelihood (4) 
but note that this representation of 7?.„(b) works for many complex log and log partial likelihood 
problems and all the ideas can be easily generated to specific cases. Note that without loss of 
generality 7?.„(b) can be written as 7?.„(b) = — £„(b) + £„(/3*) — £„(/3*). By Taylor expansion 
around (3* we have that there exists a c € (0, 1) and b* = cb + (1 — c)/3* such that 

7^„(b) = - (b - (3* f {vCnif3*)} - ^ (b - f3*f {v'A.(b*)} (b - f3*) - £„(/3*). 

Let us introduce the following notation E„(b,i) si,^\h,t)/ si"\h,t) andV„(b,t) = S^P (h,t)/ si"\ 

( d") fO') \ '82 

I S'n (b,t)/S'„ (b,t) J . With this notation at hand, the score vector and the hessian of the log 
partial likelihood have the following representations respectively 

n „r 

-{V/:n(b)} =?i-i V / (E„(b,t)-*(X,))rfiV,(t), 
-{v'/:n(b)} V / V„(b,t)dA^,(t). 

.=1-^0 

With the notation introduced in the previous section and by simple algebraic manipulations, the 
following quadratic representation of the hessian matrix holds: 

-b^{v'£„(b*)}b==||/b||^,b.- 
Together with previous Taylor expansion, we have that the empirical risk function decomposes as 
follows: 

7^„(b) = n-i^ r(b-/3*)^(E„(/3*,i)-*(X.))d^.(t) 

- / (b-/3*)''v„(b*,0(b-r)dA^.(i)->C„(r)- 

That is, for every b there exists a c S (0, 1) and b* = cb + (1 — c)/3* such that 7?.„(b) admits the 
following quadratic representation: 

(11) 7^„(b) = -(b-/3*)^h„(r) + l||/b-/;3.||?..b*-A.(r), 

where we have the concatenated score vector h„(/3*) = {v£„(/3*)} G V*''- defined as: 

(12) h„(r) = n-i^/ (E„(r,o-*(x,))dAr.(t), 

with, h„(/3*) = [hj ;^(/3*)|h^2(/3*)l ■ ■ ■ |h^.p(/3*)], where each h^ ,,(/3*) is a d-dimensional vector 

h^,,(r) = ({h„(r)bi,{h„(r)b2,...,{h„(r)},d)'^. 
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To examine the properties of quadratic representation of the risk function 7?.„(b) and its con- 
nection to a quadratic form that doesn't involve parameter b* we define the process 

(13) e„,,(b) = -(b - f3*f\in{(3*) + - //3- 111/3- - ^n(/3*), 

which is a natural extension to the right hand side of equation (11). We look at how well can 
^n.?)(b) approximate the risk function 7?.„(b) for any vector b. This approximation also tells us 
how the hcssian matrix V„(b*) is related to V„(/3*), which is not a problem for linear regression 
with quadratic empirical risk 7?,„(b) as V„(b*) = V„(/3*) for all b*. The key of this approximation 
is in the following proposition. 

Proposition 1. Let av for any vector V be = maxi<i_g<„ |v^[* (X.;) - *(X,)]|. Then the 
following sandwich bound holds almost surely for every vector b and corresponding vector h* = 
ch+{l-c)f3*, 

(14) ll/b-//3HI',/3* < \\fb-ff3'\\l^><e^'^--^' ll/b-//3.|l',/3-- 

Proof of Proposition 1. To see that the equation (14) is correct, we adopt the following reasoning. 
First, note that ||/b — //3'||^j b* equal to 

Z.»,,= l ^ q\ ^ ^^^^^^ 



with a, = (b-/3*)^(*(X,) -E„(/3*,t)) and = r,(t) exp{/3*^*(X,)} and c (l-c)(max,a» + 
mini Oi)/2. If we let = ab-/3* we can see that max^ |(1 — c)ai — c| < r//2. Using this notation we 
can see that from Q(^-<^)°-i-'^ > g-'n/'^ and e(^~'=)°'~'^ < e''/^ we have 

Il/b - //3*L.b* > cxp{-27?}n / "^^^ dN{t) 



= exp{-27?} II /b - 11,1/3- • 
Upper bound follows the same reasoning and is therefore omitted. □ 



From this proposition it follows that the process 7^„(b) is almost surely upper bounded by the 
process Gn.r; for log 7^ = 2ab-/3* and lower bounded by the process Gn.rj for logry ~ — 2ab-/3*. 
Original stochastic approximation problem is now sandwiched between two quadratic stochastic 
approximation problems of the following kind: 

e„,e--b-3- (b) + A„P(b) < 7e„(b) + A„P(b) < a„_,2.,_,. (b) + A„P(b). 

Moreover, studying properties of 7?.„(/3) can be translated to studying the properties of the two 
quadratic processes g-^ot-zs* (/3) a-nd G^^ ^^°-b-ff (/3). The definition (13) indicates that the pro- 
cesses Gn,ri have a geometric structure similar to log- likelihood of Gaussian model. However, they 
are not necessarily equal in distribution to Gaussian process since the norm || • ||„,/3' incorporates 
data which didn't come from Gaussian model. It turns out that their geometric structure is of more 
importance than their distributional qualities. 
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2.2. Local Non- Asymptotic Normality. Previously defined processes although having qua- 
dratic geometric structure are not Gaussian distributed. In case of simple linear model with Gauss- 
ian errors, they become equal to Qn,i which with big probability follow Gaussian distribution. 
However, in case of Cox proportional hazards model, due to the complicated rj term, they do not 
follow Gaussian distribution. In order to better understand their relationship to Gaussian processes 
and dimension p, we turn to relationship between introduced empirical functional norms || • ||„,b* 
and II • ||„. 

With slight abuse of notation, note that the weight process uJi{h,t) and empirical norm ||/b||n 
are tightly connected to the following weighting vector tOi (b) expressed through the sequence of risk 
sets {TZq}g^i as follows 

(15) ..(b)^f 

' ' U E,,7^,exp{b^*(X0} 

As sum of conditional probabilities that the observation i had an event at time tg, given that at 
least one event occurred at time tq, we know they are necessarily nonnegative. This representation 
will be useful for the characterization of Gaussian bounds on the log partial likelihood process in 
the following Proposition. 

Proposition 2. Let N represent the number of distinct events. Then, the uniform sandwich bound 
for the norm ||/b||n,b*, for every b and corresponding h* = cb + (1 — c)f3* , with c G (0, 1), hold 
almost surely 

(16) ^||/b||?.< ||/b||,lb. <^ll/b||', 

where ||/b||^ denotes the censored I2 empirical norm as defined in (10) and where 

uj_= min uJi{l3* + c{h — (3*)) , 

b6K'"'',cG(0,l) 
ie{l,- - ,n}:iGU"^i-R,, 

with sequence of risk sets {7?.^}^]^ as defined in Section 1 and uji{h) defined in (15). 

Proof of Proposition 2. Let N stands for the cardinality of the set {i = 1, • • • ,n : Ni{T) = 1}. Note 
that the weight process uji{h,t) satisfy the following normalization uniformly over b and t, 



1 " 

-VF,(t)L.,(b,t) = l. 
n ^ — ^ 



n 

i=l 

Moreover, for each b there exists at least one i € {1, . . . , n} such that uji{h, t) > and that for all 
i, for which 3t € [0,t], Yi(t) = 1, we have that uji{h,t) < n. ^ Let us denote with 

oj,{h)= f Y,{t)uj^{h,t)dN{t). 
Jo 

If ti < ■ ■ ■ < tN are ordered failure times and TZj = {i E {1, . . . , n} : Zi > tj} is at risk set, then 
Wi(b) has the following representation: 

..(b)^y:--p^^'^(^-)y-^^^>. 

U E,e7^, cxp{b^*(x,)} 



^ Assume that there exists at least one i such that ^^(b) > n and Yiit) = 1. Then, 1 < —Yi(t)uji{h) + 

^ E;=i,,^. n(tH(b) = ^ E^=i v,(tH(b) = 1 
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Firstly, note that Wi > and oJi > ioi i £ I — {i £ {1, . . . ,n} : i £ UjLiTZj}. Using previous 



notation we have 



n I n 

l/blllb* = -5]/^(X,V,(b*) - -^/b(X,V,(b* 
n ^ — ' \ n ^ — ' 

1=1 \ 1=1 



With this notation at hand we have that 

AT 



and we are able to conclude N > uj = max{w,;(b) : i e I, b G RP*'^} > N/\I\ > oj = min{a;i(b) : i G 
I, b G RP*'^} for I {i G {1, . . . , n} : i e Uf^iUj}. Hence, 



ll/blLV <'^ll/b|ln-^(^^E/b(^.)j <iV||/b||^. 

To obtain the left hand side of (16) remember that from previous exposition we have 

1 " 

ii/bii'.b* = -E^^(bb)(/b(x.o-/^ 



_ ib)^ 

i=l 



with /b = ^ S"=i <^j(t'b)/b(Xi). Hence, by centering the data so that the sample mean is equal to 
zero, that is \ Y.i^\ /b(Xi) = 0, we have 

ii/b!iib. > aii^(/^(x,) + {/a')+2^^/^, (^E/b(x*)) 

> ^- V/2(X0 =a.||/b|U. 

rj ^ — ^ 



71 



□ 



The lower bound in the previous proposition is always non-trivial as w > but there is very 
little hope that it is significantly bounded away from zero for all choices of parameter b, in the 
whole "pd dimensional space. To that end, note that risk sets TZj are naturally nested as follows 
7^l D • • • D T^AT with U^l7^J = 7^l and r\^^-Jlj = TIn- Then, for i G 7^l \ 7^2 

exp{b^^(X,)} exp{b^'i'(X,)} 
E/67^iexp{b^*(Xi)} " |7^l|max^eKlexp{b^*(X^)} 
Following similar reasoning we have for i G 1^2 \ that is i G 72,i, i G 72-2, i (f. TZs-, 

..(b) > ^-^i^^i^:^} 

|7ei| max,g7^, expjb' *(X0} 

where we used maxjgTj^ exp{b"^'S'(X;)} < max^g^j exp{b"^'S'(Xi)} and |7?.i| > \TZ2\- By applying 
similar reasoning we can see that for i GTZpf 

iVexp{b^*(X,)} 



\TZi I maxigT^i exp{b' *(X0} 
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Knowing minigi a;j(b) = min {min,g-R,j\7^2 ^^(b), min.gT^^^T^g (b) • • • ,minjg7?,„ Wi(b)} we are left to 
analyze relative ordering of the minima on the right hand side of previous equality. It is obvious 
that the minima on the right hand side are non-decreasing, hence concluding that 

/> N ■ N ^ minig-Ki exp{b^*(Xi)} 

mmwilb) = mm b > Tp . 

iGi r^n^\n, |7^l|max,67^lCxp{b^*(X,)} 

It is common to assume that all n observations belong to set T^i, making its cardinality of size n. 
From the last inequality it becomes clear that when the range of random variables exp{b"'"\I'(Xi)} 
becomes smaller, w is more bounded away from zero. Guaranteeing that would amount to requiring 
supbgT^pd ||/b||cx) < clogn, for some positive constant c. Obviously such a requirement in high 
dimensional spaces restricts the set of functions / significantly. On the other hand, requiring that 
the sparse approximation has the property of ||//3*||oo < clogn, requires s < logn, as constant 
c is bounded (all uji{(3*) are bounded random variables). Note that the usual upper bound on the 
size of the sparsity set is n but due to the complex problem structure we have to pay the price in 
requiring much smaller sparsity. Similar conclusion was reached in the recent paper Bradic et al. 

(2011) where they discovered that situation of s = 25, p = 1000 and n = 100 is not sparse enough 
and even the oracle estimator fails empirically, compared to s = 4,p = 1000, n = 100 where the 
oracle estimator was behaving regularly. In that sense, next Corollary 1 gives a complete picture 
on how the sparsity size constraints the problem at hand and shows that the interplay between true 
and effective dimensionality in survival models is different from the case of linear regression. 

Corollary 1. Let N represent the number of distinct events. Then, the uniform sandwich bound 
for the norm ||/b||„,/3* 

(17) ^||/b||^<||/b|ll;3- <^ll/b||^, 

where uj_ = min{a;i(/3*) : i G {1, • • • , n}, i G U^^iTZq}. 

Propositions 1 and 2 show a non-asymptotic parallel to Le Cam's theory of local asymptotic 
normality (LAN) (Le Cam, 1960), which shows that a statistical model can be locally approximated 
by a gaussian model. Lc Cam's approximation is asymptotic in nature. Our extension is non- 
asymptotic, as it bounds our model from above and from bellow with geometrically Gaussian like 
models, that is we identified lower and upper processes C/„_^j(b), Gn,ri2{^) such that there exists a 
ball B^*(r„) such that 

P(e„,rn(b) < 7^„(b) < ^„,^,(b),Vb € B^.(r„)) = 1. 

Independent work that discusses non-asymptotic extension of Le Cam's theory appeared in Spokoiny 

(2012) . Our work for Cox model can be viewed as a non-trivial extension of their work on general 
log-likelihood structures with the following reasoning. Their bounding processes are defined for 
p <C n and by shrinking and expanding of the equivalence of Hessian matrix \7^£„. This approach 
is suitable for cases where such a matrix doesn't depend on parameter space. In other cases, such 
as the situation in our setup, it is not obvious how to define lower and upper bounding processes 
with this approach. Furthermore, by adapting steps of condition (EDi) and Theorem 3.1 of their 
approach to the Cox model, the bound achieved is not as tight as the approximation (16).^ 



^ Parameter uj{r) in condition (EDi) of Spokoiny (2012), would have to be chosen strictly positive, due to 
dependence of score vector on the parameter space, thus requiring the existence of strictly positive constant p and 
addition of non-negative definite matrix Vq in defining the norm ||/b||n,/3* = 5I]"=i /J'C^ ~ /9*){Vn(/3*) -I- 
pyo}(b-/3*). 
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By taking inherently different approach we were able to identify two type of lower and upper 
bounding processes Gn,r]i(^), Gn,ri2(P)j defined in (13), where in one bound 771 = exp{— 2ab-/3*} 
and in the other rji = lo. Different from classical LAN approach of Le'Cam, from Propositions 1 
and 2, we can see that both 771 's principally depend on p, where they become zero as p increases. 
For local neig hborhood ]B(r„) of size r„ > 0, defined as {b : X;^=i d^^^^\h - l3* W-y, < r^/d}, by 
Proposition 1 we have P g-2Cr„ (b) < TZnih) < ^„^e2Cr„ (b),Vb G B(r„)) = 1, or in other terms 

\\lp*d - {v'A.(r)}-'/'{v'^n(b)}{v'^n(/3*)}"'/'lloo < e'^'-",V6 e B(r„). 

The above inequality is log likelihood alternative to irrcprcscntablc condition (Meinshausen and Biihlmann, 
2006) defined to I2 loss functions, where V^'C„(b) = \7^£„(/3*) for every b G TU"^. This inequality 
also shows that geometrically quadratic bounds to the Cox model are not independent of the size 
of the local neighborhood r„. In the following, we discuss the relationship between these bounds 
and the choice of r„. 

It is clear that, when r„ does not converge to zero, the usual upper bound in the right hand side 
of the last inequality (as assumed for linear models with I2 criterions), cannot hold for any b in 
Cox model. If we are in local small neighborhood where r„ — >■ 0, then there is hope that such upper 
bound might hold. In Spokoiny (2012), for the case of generalized linear models (see Theorem 5.7 
of cited paper) similar conclusion was made. There rji — \/l ~ 6,'i]2 = yl + J, S > (5(r„) > such 
that 

\\Ip*d - W^Cn{(3*)}-'/^v'Cn{h)}W^C,,{f3*)r'^'\\oo < <5(r„), 
for all b € B(r„). If r„ is fixed and bounded away from zero, then rji and r/2 will be too small and 
too large respectively. It hence requires uniformly small bound on the value of 6{r„), which on the 
other hand grows with r„. It shows that previous inequality, needed for sandwich bounds, cannot 
hold for all values of b. 

Situation is better for correlated linear models (van de Geer, 2011), where sample and 

population covariance matrix respectively, indeed satisfy 

!|S-Si|oo < V^ogP/n, 

and finite sample oracle prediction properties of simple linear models carry over to correlated linear 
models. Previous discussions show that, this may not be possible in the Cox model and that the 
ideas of local non-asymptotic normality, as in Spokoiny (2012), cannot trivially be extended to high 
dimensional spaces. 

For that end, to extend the idea of local non-asymptotic normality to high dimensional problems, 
we develop new two stage technique. Firstly with the help of sparsity and rcgularization, we 
localize the problem to small, dimensionality independent local neighborhood, where wc show that 
our penalized estimator (3 belongs to B(r„) with exponentially big probability (see Theorem 3 in 
Section 4 for more details, where we prove that r„ is proportional to s^A^^/C^). Secondly we localize 
the complexity of the problem further by shrinking v^£„(b) in that local parameter space to its 
sparse alternative \/'^Cn{f3*). Then, we show how to inherit the finite sample oracle properties of 
penalized I2 criterions and behave "normally" (see Theorem 4 in Section 4 where we show that 
proposed estimator inherits prediction error of penalized least squares estimator). 

3. General Oracle Inequality 

In this section we show how general oracle inequalities of quadratic type (Rigollct, 2012; Lccuc and Mandelson, 
2012) (from now on denoted with GOI) cannot be achieved for Cox proportional hazards models (2) 
even for low dimensional problems with convex penalty function. We show that under no conditions 
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on the correlation structure in the covariates, oracle inequalities only seemingly reach prediction 
properties of an oracle estimator with the expected slow rate of \ogp/n (see Theorem 1). They 
become only local properties, as to have non-trivial bounds, they require (3 to be in a local neigh- 
borhood of the sparse alternative (3* (see Corollary 2). Interestingly, wc show that they cannot 
extend to hold outside of such local neighborhood. It becomes apparent that in order to obtain 
dimensionality free finite sample prediction results we need to first show some localization property 
of the estimator (6) (which we leave for Section 4). 

Note that from (11) we have that the following representation holds uniformly over b 

UrM - 7^„(b) = ill/3 - ^/3*!l-b3 - - //3-|l,lb. + (b - 3)^h„(/3*) 

for = c(3 + {c — 1)13* and b* = cb + (1 — c)(3* for a particular choice of c e (0,1) and 
c — c(b) e (0, 1). Note that this quadratic representation inherently depends on the structure of 
the log likelihood £„, but as long as one can define an empirical norm such that — b {v^'C„(b*)}b = 
ll/blln b . ' it '^ill for any log or partial log- likelihood. In that sense the result of Theorem 1 
presented in (20) will hold for any log or partial-log likelihood structure. Note that the later do 
not have to have an i.i.d. structure and in that sense GOI result presented here are much more 
general and hold over wider class of problems. The probability bound in (20) will differ from one 
example to the other. From the definition of the penalized estimator as the minimizer of penalized 
empirical risk in (6) we have 72.„(/3) + A„P(/3) < 7?.„(b) + A„P(b). Combining previous we get 

(18) 11/3 - fp' Illb3 < ll/b - fp' lllb. + 2(3 - b)^h„(r ) + 2A„(P(b) - P(3)), 

for any b and b* , b^ fixed and defined as before. 

The following condition is needed for controlling the size of the score vector h„(/3*) which is 
necessary for obtaining risk properties of the estimator (3. 

Condition 1. Following conditions are satisfied: 

(i) There exists a continuous function s'"-' defined on f3* x [0, r] such that 

sup |^W(r,i)-5<°^(r,OI <«, inf \s^'\f3*,t)\>b>0, 

0<t<T 0<t<T 

almost surely, for some positive constants a,b > such that b > a. 

(a) There exists a multivariate continuous function s^^\ population version of Sn \ defined on 
(3* X [0, r] such that for each \ < j < p, 1 < k < d, sup^^jQ \{s''^^}jk{(3* ^ t)\ < bi, for some positive 
constant bi > 0. 

(a) Process Y(t) = (Yi(t),--- ,Yn{t))'^ is left continuous with right hand limits and such that 
P{Y,{t) - 1, < t < r) > /or J = 1, . . . , n. 

This set of conditions replaces classical conditions used in the asymptotic analysis of estima- 
tion properties of the Cox model, such as those presented in Condition 2 of Bradic et al. (2011). 
They are in particular relaxed to handle high dimensional settings, since they are required only 
at (3* and not necessarily uniformly over parameter space, compared to classical assumptions of 
Fleming and Harrington (2005). 

The following Lemma 1 guarantees that the convexity of the penalty function P(b) is bounded by 
the growth of the linear part of stochastic approximation of the risk function 7?,„(b). Furthermore, 
it enables us to conclude that the penalized estimator (3 will be sandwiched between arg min of two 
penalized Gaussian type processes of the form Qn.ri + -^nP with function P as in (7). 
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Lemma 1. On the event £„ = |l|hnj(/3*)||-Y» < Xnd}^''^ p'{0+),yj G {l7---,?'}| we have for convex 
penalty functions defined in (7), that the following holds 

(19) e„,o(0)= min {g„,o(b) + A„P(b)}. 

Note that Proposition 1 does not require any type of irrepresentable condition because it is a 
result on the lower linear (and not quadratic) approximation of the penalized process 7?,n(b) + 
A„P(b). Therefore, we are able to obtain general oracle inequality (GOI from hereon) without any 
assumptions on the correlation structure in the data. Similar properties have been established for 
h type of penalized criterions (Gaiffas and Guilloux, 2012; Lecue and Mandelson, 2012; Rigollet, 
2012) where empirical ||/b|lri norm was an essence. In the next section we will provide GOI for 
nonparametric Cox model (1). The following theorem is the main result of this section. It represents 
general oracle inequality showing that the penalized estimator /3 reaches the risk properties of an 
oracle estimator measured through the functional norm (9) introduced in Section 1. 

Theorem 1 (GOI). Let (3 he defined as in (6) with the convex penalty function P(b) defined in (7). 
Then, on the event £„ = O^^iSnj, £n.j — |l|hnj(/3*)||7* < And^^'^'j yo'(0+)|, there exists c e (0, 1) 
and (3 = c/3 + (1 — c)/3* such that 

(20) 11/3 - .//3.|l',bg < min {!|/b - /;3HI',b. +4A„P(b)} 

for b* = cb + (1 — c)f3* and c ~ c(b) G (0, 1). Moreover, there exist a positive constant ci > 0, 
such that the event satisfies 

(21) P{£n) > 1 - GpdeM-cmXld^/ p'\0+)}. 

Proof of Theorem 1. Utilizing Lemma 1 we have that on the event £„ — {|h„(/3*)|oo < A„p'(0+)}, 
for A = 3 - b e 7^P*'* the following holds: 

((3*fh„{f3*)-Cn{f3*)^g„,o{0) < a„.o(A) + A„P(A) = 

-(A - /3*fhn{f3*) - Cn{f3*) + A„P(A), 

which is equivalent to A^h„(/3*) < A„P(A). Together with (18) it leads to 

||/3-//3HMb3 < l|./b-//3-|l,lb.+2A„P(3-b) + 2A„(P(b)-P(3)) 
< ||/b-//3-|l',b. +4A„P(b), 

for some = c/3 + (1 — c)/3*, b* = cb + (1 — c)/9* and c, c = c(b) € (0, 1). The last inequality 
holds due to increasing, convex and symmetry properties of the penalty function P (which leads 
to subadditivity) . With the last inequality the statement of the general oracle inequality (20) is 
proved. 

We arc left to compute the probability of the event = |l|h,ij(/3*)||7* < Xnd^^''^ p'{0+), 
Vj e {1, . . . To that end note that 

(22) ff, C y {||h„,,(/3*)|U > A„di/^.*p'(0+)} , 
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where ||h„j(/3*)||oo = maxi<fc<d |{h„}jfc(/3*)|. First note that we can decompose {h„}jfc(/3*) = 
Vjk + Vjk as follows 

(23) n-^y2 {{En{f3*)}jk-{ei(3*)}jk)dM, + n-^y2 ({e(/3*)},fe - *fe(X,,)) dA/,. 

^=l -^0 »=i -^0 

With this in mind we have that 
P ( 

(24) ^ U I^^fc I ^ ^"^'^^^ U I'^^fc I ^ ^"^'^""^ 

^-^ l<A;<a l<A;<a 

In order to get rates of convergence, we will extend the idea of Theorem 3.1 in Bradic et al. (2011) to 
Hoeffding's type of inequalities for maxima of averages. This will be possible due to the particular 
structure of the martingale {h„}jfc(/3*). To that end first note that by the boundedness property 
of functions ^ wc know that "^kiXtj) arc bounded random variables. Hence, each Vjk is a sum of 
a sequence of i.i.d subgaussian random variables. However, across fc's i.e. group elements, vjk are 
not independent random variables. Hence, we apply the extension of Hoeffding's inequality (see for 
example Lemma 14.15 in Btihlmann and van de Geer (2011)), 



P [ maxji^jfel > ||Af||„J2 (^t^ + ^^^^ j < 2exp{-nt2}, 



where ||M|j„ is proportional to y ^"^i Mf (t) . In the last statement we have utilized uniform 

boundedness of B-spline basis. Since M is a bounded martingale, we can conclude that there exists 
another constant c > such that IIMIL < c. Hence, wc have 



(25) P ( max > Xnd^^'^h' {0+)] < 4rfcxp ■ 

\l<k<d J 



4c2 



Next, we need to develop a Hoeffding's type inequality for martingale sequences in order to 
handle Ujfc's. First we start by bounding the jumps and the predictable variation of the martingale 
Vjk. Note that the jumps are bounded by 

\Av,k\ = - |{E„(r)},fe - {e(/3*)U.| < - sup |l{E„(r ,i)} - {e(/3*,t)}|L ■ 
n n o<t<T 

For the predictable variation process we have 

{Av,k)2 = ^ r m,{(3*,t)},,-{ei(3\t)},kfd{Mit)) 
Jo 

< - sup ||{E„(r,t)}-{e(/3*,i)}||L r S^y{(3*,t)dAo{t) 
n o<t<T Jo 

< ^ sup ii{E„(r,i)}-{e(r,o}iiL: 

n o<t<T 

where by utilizing Condition 1 there exists a constant ci > such that sli\f3* ,t)dAo{t) < 
s(°)(/3*,i)dAo(i) + supo<t<^ S^n\f3*,t) - s^°^{/3*,t) Ao(t) < ci. The following Lemma is crucial 
in establishing Gaussian quadratic rates in Bernstein's inequality. Its proof is relegated to the 
Appendix. 
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Lemma 2. // Condition 1 is satisfied the following bound holds almost surely: 
(26) sup ||E„(/3*,t)-e(/3*,t)||oo<2a?, 

0<t<T 

for ai=bi_+aC + sup^ \s^"\(3* ,t)\, C = maxij, fe |«'fe(Xij)|. 

From Lemma 2 we have that there exist constants ci, K and C2 = 2af and Ki such that | Auj-^l < 
C2/n = X and {^Vjk)t < ciC2/?t. = ii'i- To that end we can now apply martingale deviation 
inequality (van de Geer. 1995) to Vj/. to obtain 

P{\vjk\ > Un) < 2cM-riul/{Kun + Kf)}. 

For the choice of w„ ~ Xny/nd^^'^i /9'(0+) note that Kun ~ XnC2p'{0+)d^^'^j /y/n < A„C2(i'^^^J /y/n < 
C2d^/^i I \fn. Note that K\ = cic|/n. Since c\ is an upper bound on Sn\f3* ,t)dAo{t) we can 
always choose it to be larger than ^/nd (which for a typical choice of d becomes a constant) 

and as such it will satisfy that = cic^/n > C2d^^''i j ^/n > Kun- With the last statement, using 
union bound we have 

P ^ rnax^ \vjk\ > < 2d cxp{~nu'^/2Kf} . 

Together with (25) we conclude that there exist another positive constant ci ~ max{^, such 
that 

< Gctmax < exp < -n — > , exp < -n 

< 6dexp{-cinA2d2/7*p'2(o+)}, 

for each 1 < j < P- This implies by simple union bound that 

p 

Pi£n) > 1 - 6dY,exp{~cin\ld^^"'^ p'\0+)} > 1 - epdexpi-cmXld^/''''''''-'^ p'\0+)}. 



□ 



The risk properties of an estimator j3 are stated in terms of empirical functional norms || • ||„ ^■ 
Note that parameter b* on the right hand side of (20) is truly different for every b, as each constant 
c = c(b) is different for each b. Empirical result of similar kind appeared independently in the 
Theorem 6.1 of van de Geer (2011) on structured group lasso for linear regression. Because of the 
complex log likelihood structure we have to sacrifice simplicity and state the results in terms of two 
different empirical norms. They become equal to each other in the special case of linear models 
for which case they become equivalent to the empirical norm || • Lemler (2012) considered GOI 
results for the Cox model with known baseline hazard in terms of a newly defined empirical Kullback 
Leiblcr Divergence (see Theorem 2 of aforementioned work) for Lasso penalized conditional hazards 
model. It is not clear how that result relates to I2 empirical norm (10) and how it is affected by 
dimensionality p. Opposing their work we establish that the relationship between the two norms 
might be significantly impaired by the dimensionality p and that one has to be careful when stating 
results in terms of empirical norms as they might become trivial and non-sensical. 
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In more details, note that the obtained GOI result of Theorem 1 is not an exact GOI in the sense 
that the left hand side and right hand side have leading coefficient equal to one (Lecue and Mandelson, 
2012). Although it might look like they do, due to the structure of the empirical norms || • ||n,b^ and 
II ■ ||n,b* and Propositions 2 and 1 this is clearly not the case (see Corollary 2). Due to the complexity 
of log partial likelihood /C„(b) it is impossible to obtain an exact GOI in this case. There is very 
little hope that GOI similar to linear models can be achieved here. More generally, we conjecture 
this holds for many complex log-likelihoods whose Hessian matrix depends on parameter structure. 
To that end, notice that from the discussion on Local Non- Asymptotic Quadraticity of Section 
2.1 and Local Non-Asymptotic Normality of Section 2.2 we can conclude the following localized 
versions of the previous Theorem. 

Corollary 2 (Local GOI). Using the notation and conditions of Theorem 1 the following holds 
on the event H {uj_ > 0}, for e ~ {N — uj)/uj_, 



with B(r„) ^{he W"^ : Jjj^i d^^'^' W^j - /3j II7, < r„/d}, and £„ defined in Theorem 1. 

Result of Corollary 2 resembles mentioned GOI results for linear regression models, but requires 
that w is bounded away from zero, a condition that cannot be guaranteed to hold in high dimensions. 
For example, the larger p gets, the smaller w gets. With sufficiently small w (for example of the form 
a; = 5, 5 — >■ 0-I-), e explodes. Moreover, this condition is connected to assuming strong convexity of 
the partial likelihood in the whole d dimensional space. In recent work Negahban et. al (2012) 
discuss similar problem of global and local strong convexity in generalized linear models, where 
they reach the same conclusion that such conditions are far less restrictive in local neighborhood 
B(r„). They also address the importance of identifying these local neighborhood in a specific model 
(see Section 4 on the shape and size of these neighborhood for non-parametric Cox model (2)). In 
that sense (27) is truly a local GOI result that depends on the size of local neighborhood B(r„). 
Lemler (2012) does not discuss similar properties of their estimator. We conjecture that it suffers 
from the same locality issue. 

In this sense GOI results of the least square type, like the one in (27), become localized GOI re- 
sults for complex models. We believe they cannot trivially be strengthened to hold in full generality 
without restricting dimensionality of the problem, except for a few special cases where empirical 
norm (9) is independent of b* . For that end, to be able to infer oracle results of Gaussian type, we 
first need to make connections of the introduced norm (9) to the norm || • ||^* (see Proposition 1). 
This leads to a two stage argument. In the first stage, the local property of the penalized estimator 
(3 gives the desired dimensionality free connection between the norms || • ||n and || • ||/3*. In the 
second step, we could localize the penalized estimator further and infer finite sample || • ||^ oracle 
inequalities that hold for p ^ n. 



In this section we expose the details of the mentioned two-stage argument, and show how f3 
inherits finite sample predictive properties similar to penalized least square estimator. Localization 
of the problem in the first stage is done with the help of sparse oracle inequality (Theorem 2) which 
provides tail bound on the estimator (Theorem 3). Further localization of the second stage is done 
with the help of newly developed technique of local non- asymptotic normality where we relate the 
complexity of the local Cox model to the local linear Gaussian regression model (Theorem 4). As 



(27) 




4. Oracle Inequalities of Gaussian Type 
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a by-product we show how the censoring rate N In interplays with dimensionahty p and affects its 
predictive power. 

Define the following set of restrictions 

(28) C^,, - {be MP*^ : ||p(b^;)||i < M^M.)\\i] , 

where ||p(b)||i ~ -P(b) is the penalty function defined in (7) and = M*{f3*) = {j £ {1, . . . ,p} : 
\\f3*\\jj ^ 0}, card{A^*} < s. The set C^_p consists of all vectors that have support similar to 
the sparse vector f3* and changes its shape depending on the penalty function p and the choice of 
7j's. It represents a generalization of a cone constraint condition that appears in work on Lasso 
problems with li norm being replaced with FGP function to represent the complex group structure 
present in the model. To prove sparse oracle inequalities we will use condition in the spirit of 
Bickel et al. (2009); Meier et al. (2009); Lounici et al. (2011). This condition arises naturally when 
studying finite sample risk properties of Cox model. Depending on the type of penalty functions, it 
represents uniform adaptation of Restricted Eigenvalue Condition (Bickel et al., 2009). We refer to 
van de Geer and Biihlmann (2009) for comparison of different kind of compatibility and restricted 
eigenvalue conditions and their relationships for linear models. Irrepresentable condition, as defined 
in Meinshausen and Biihlmann (2006) for linear models, takes more complicated structure for Cox 
type model (see Bradic et al. (2011) for more details) so we refrain from adopting it for non- 
parametric hazard models. 

Restricted Eigenvalue Assumption RE(^, 5,7): There exists a positive number ( = C(s) > 
such that 

(29) min Wj- C if3* +SA)Vf^All. ^ 

where \\{-^^ C^/S* +SA)y/^ A\\l = A^ {- C,,{/3* + S A)} A and ||A^,J|?^^ = 11^,11^. 
Note that the usual scaling factor of ^/n disappears in the definition of the restricted eigenvalue 
condition because it is included in the definition of the empirical norm ||/(-)||^ . in the numerator. 
Moreover, compared to RE condition in Bickel et al. (2009) the denominators differ in that I2 norm 
is replaced with h^-^ norm. For every Si G (0, 1), A^{- Cn{f3* + <5iA)}A = IL/aH^^ for some 
particular b = /3* + SiA. From decomposition of the risk function TZn we have 

> ^ . !|/A||ri./3*+<5A ^ . Il/^lln.t 

Q < mm — — — — < mm 



Aec^.p,A5<to,5e(o,i) ||AA^J|i,-y agc^^.p^At^o || A^^. 

In comparison to least squares procedures where S7'^Cn{f3* + 6 A) = v^'C„(/3*) = — X"^X and 
restricted eigenvalue conditions are defined on the eigenvalues of X^'^X, we need to impose uniform 
eigenvalue bound as in (29) over all S G (0, 1). In more details, following Lemma 1 wc have that 

(29) implies 

(30) mm TT— -p — > mm " ^ 



AeC^,p,A5^o ||AAiJ|i,-y AeCp,p,A#o ||A^J|i,..y 

where a a was defined in Proposition 1. Therefore, assuming a point- wise lower bound on the 
minimum eigenvalue of - 'C„(/3*), i.e. assuming minAGCj,,p,A5^o{||/A|U,^'/|| AtkJIi,-,.} > C wih 
not guarantee positive lower bound in (30). Condition (29) can be seen as a rescaling of the minimum 
eigenvalue problem in classical RE condition needed for the complex likelihood structures. 

Determining the class of matrices that satisfy RE(/^, 3,7) condition is an important open ques- 
tion that needs tremendous novel work, as the random process V„(/3* +6 A, t) doesn't not belong to 



FINITE SAMPLE PREDICTION BOUNDS 



19 



Gaussian random ensemble for any t and has comphcated correlation structure. In particular, with 
respect to time it has martingale alike structure and with respect to location i.e to (3* + 6 A, it is a 
function of the matrix J2i=i J2^=i O^i)'^ O^q) ■ Using Condition 1 and boundedness of functions 
matrix V„(0, t)dN{t) will belong to a random matrix ensemble with sub-gaussian tails, which 
were studied in Zhou (2009) and Raskutti et al. (2010) respectively. More specifically, we can easily 
extend their results, to conclude that when S = — and the sample size is sufficiently large, 
there exists a positive > such that 

||{-V^ £„(r +^A)}V^AH2 
mm — r > Qs- 

This is however, not sufficient to indicate the sensitivity of RE to change in location and how this 
restricts the family of matrices that satisfies (29). Although an important question, it lies outside 
of the scope of the current paper. We arc now ready to state the main result of this section. 

Theorem 2 (SOI). Let (3 he defined as in (6) with the convex penalty function P{h) defined in 
(7) and let the assumption RE(7, s,^) hold with ( ~ C{s)- Then, with probability no less than 
1 - Gpdexpi-c^nXld'^/'^^''^^ p'^{0+)}, for some positive constant Ci > 4, under Condition 1, there 
exists a c S (0, 1) and (3 = c/3 + (1 — c)/3* such that 

with b* = cb + (1 - c)/3*, c = c(b) G (0, 1) and where \M.^.\ = card{M^^.}. 

We note that, the previous result does not necessarily imply oracle- type properties of prediction 
error measured through I2 norm ||/^ — //3* ||^. If the dimensionality is small enough to guarantee 
w > (w defined in Section 2.2), then Gaussian like SOI result would hold without additional work. 
The argument does not hold if p > n for which we develop two-stage technique. We believe similar 
reasonings will extend to all settings where the complexity of the model is so severe that the local 
neighborhood B(r„) of Section 2.2 depends on the dimensionality. 

Corollary 3 (Local SOI). Let notation and conditions of Theorem 2 hold. Then, on the event 
SnCi {oj> 0} for < e ^ {N - w)/w 

With A„ = 72A2E,eA,. d^'^'^ UNC)- 

Note that e is quite large for large n and p, making (32) trivial (with similar arguments as 
Corollary 2). Local quadraticity and local normality of Section 2, give us guiding principles to 
inherit local gaussian like properties, while at the same time a way to extend them to the whole p 
dimensional space. If we are able to bound o,^_pt from Proposition 1 , then we can see that nesting 
two neighborhoods, one that carries quadraticity and one that carries normality, local properties 
might extend for the whole space. Bounding a-^ p, will show that we almost never end up outside 
of intersection of those two neighborhoods. 

To that end wc prove the following Theorem 3. It shows that with high probability, the estimator 
/3 lies within small elliptical neighborhood that carries quadraticity. For appropriate choice of tuning 
parameter A„, area of this elliptical neighborhood becomes dimensionality independent. In that 
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sense, it provides tight, dependent bounds in the first stage of introduced two-stage argument 

(see the end of Section 2). 

Theorem 3. Let Condition 1 and the assumption RE(3, s,^) hold with C ~ C('^)- Then with 
probability no less than 1 — 6pdcxp{~cinX^d^^ p' (0+)},, for some positive constant ci > 4, 

for (3 defined in (6) we have 

(33) < l4exp(26C^ ^ d^hA ^ ^V.' , 

Proof of Theorem 3. Following the same steps as in the proof of Theorem 2, but instead of consid- 
ering all b, if we concentrate for the special case of b = /3* we have from (55) for A = /3 — /3* . 

ll/3-//3*|ll3*+A„Erf^/^^V(llA,||,,)<4A„ d'/"^V(llA,!|,J. 

This implies that E,eA^;(b) d'^^'^Vdl A,||^J < iT.,^M,{'^)d^'^' P{\\^,U) or that A e C3 as 
defined in (28). This combined with assumption RE(3,s,7) in (29) gives 



11/3 11^,3. <4A„ d'/''^*p(ll3,-/3*ll.,)<4A d'/"^*ll/3-//3-|l„,r/C, 



(34) ||/3-//3.|lf,3.<16^ E 



where in the last step we utilized a version of (56) for b = /3*. Solving previous inequality we have 
that 

^ jeM, 

Note that from RE(3,s,'y) and previous inequality, 

16^" Vd^/-^; > wf f p _iv (3-r)"v»(b3)(3-r) 

^ jeM, WPm, ^ PM,\\i,~f 

where we used the notation ||/3^_^ — /3>ijli.-y ~ SjeM, Wf^j ^ f^jWij- Fi"om Cauchy Schwartz 
inequality we have that 



Knowing that /3 — /3* G C3 and using the convexity of p wc have |l/o(/3^e — f3Mc)\\i < 3||/5(/3^^ 
Jill and hence 

j=l j6A^* j^M, 

^^^'^ jeM, 
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From Proposition 1 and the last inequality we have that ||/^ — jj^^ ^* < ||/^ — p -.e , 
with < 2maxi<i<„ |(/3 — /3*)"^\I'(Xi)|. Result of the Theorem follows from noticing 



j=i j=i \fc=i 

p 



j = l ^ ^ j<£M 



□ 



We are left to show that gaussian type of SOI results extend to hold even with the complex log- 
likelihood structures. Inside neighborhood of type (34) we can now further localize the problem by 
using Gaussian processes (from Corollary 1) as sandwich bounds around our log-likelihood process. 
In that way we show that elliptical neighborhood can be further shrunk into I2 balls of smaller radius. 
This procedure constitutes second step in the technique we developed leading to the following main 
result. 

Theorem 4 (Global SOI). Under assumptions of Theorems 2 and 3, with probability no less than 
1 — 6pdexp{—cinX^d^^ "^^^''^ p' (0+)}, for some positive constant ci > 4, 

(35) \\f^-fp-^\\l<{l + e)il + 5)__mm^ ^^ {2\\f^-fp4l + 3A,,} 



beKf*'^,|A^.|<s 



for a nonnegative constant e = ^ — , S = exp{4Cr„} — 1, r„ = ^2^" yJ2j£M, '^^'^^ '^'^'^ — 
2fi^i^V d'^l^'i 

Proof of Theorem 4- From Theorem 2 we have 

11/3 - //3*ll'3* < 2!l/b - //3HI'.b* + Sa^ 

for any vector b such that s(b) < s. Note that any such vector b g Co, p. Moreover, we are only 
interested in looking at the minimum over local neighborhood, hence to that end notice that 

(36) ll/b - //3*|l«,b* < llbM. -/3m.IIi.7 . ^ , II/aIL,;3 ^ ^^2 

where Cmax is the upper bound on the maximum eigenvalue problem stated above. Note that due 
to RE{3, 5,7) assumption, we can see it is of the order of . By following similar arguments as 
in Theorem 3(after equation (34)), we have 



utilizing Proposition 1 and Proposition 2 

ll/b - f,^ < e^^^^A/^^^^^ll/, - f,4l,, 
<Ne'^'^^^^^^^^\\f,- f,4l 
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where C > maxfc_,;j )|. On the other hand, from Theorem 3, we have a^_^. < 26C^^^ X^jgA^. d^/''^ . 

Hence, using Proposition 1 and Corollary 1 we are able to conclude 



-26C 



- fp'Wn.p* 
fp - fp' \\n- 



Moreover, combining these last few inequalities we have 
+ 



72iV, 2 26C^E-.M d'/^^' 



^ = JEM, 

< (l + e)(l + 5){2||/b-/^HI'+3A„}, 



fore=_=, A„ = 26^ 



■ d'/^^' and <5 - 1 = cxp {2C^^E,,m. } • 



□ 



In comparison with similar results obtained for least square problems, previous result differs in 
presence of the exponential term on the right hand side of (42) and the direct influence of censoring 
rate N/n. It leads to inherently different choices of tuning parameter An compared to simple linear 
regression where the choice is only governed by probability bounds. Here the choice of A„ leads 
to a tradeoff between big elliptical neighborhood and small probability bound. Nevertheless, for 
typical choices of tuning parameter A„ (see Section 6), the exponential term 5 becomes bounded by 
a constant when the censoring is not severe, making this result comparable to the one obtained for 
least squares setups (sec for example Meier et al. (2009); Massart and Mcynet (2011); Lounici ct al. 
(2011)). 

To that end, note that w (as defined in Section 1) is much larger than w as it is defined at 
the true point (3* and independent of the dimensionality of parameter space. For that reason e is 
independent of dimensionality p allowing us to handle exponentially high dimensional problems. 
Moreover, it is independent of n and s if the following holds 



(37) 



P 



maxig-K„ Wi(/3*) 



i\7^2 Wj(/3*) 



> c < e" 



where c > and Un{c) is a diverging sequence. It is easy to see that with subgaussian covariates 
Xi and functions {'^k}k=n t^^is inequality holds true as long as we assume that the norm ||//3*||oo 
is bounded. 

This equation (37) can be viewed as non-asymptotic equivalence of Le Cam's "uniform asymptotic 
negligibility condition" , needed for Le Cam's second lemma. Le Cam's second lemma shows that the 
estimator has asymptotically normal distribution and our general SOI result of Theorem 4 shows 
that our estimator inherits non-asymptotic predictive behavior of penalized gaussian estimator. In 
this way penalized estimator (3 behaves "similarly" to penalized least squares estimator in linear 
regression models. In more details, we have the following result. 
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Corollary 4. Under assumptions of Theorem 3, with probability no less than l—6pd exp{—cinX^d^^ max7j p/2j-Q_|_'j|^ 
for some positive constant c\ > 4, 

(38) 11/3 -/^Hl^.< 3(1 + e)(l + <5)16^ ^ d'/^^ , 
for e={N-^)/^ and 6 = exp js ^"^^^* '''' | - 1. Moreover, 

(39) ii3-riii,7 = Eii3,-/3*ii., <i3y^§ ^ d^/^^ 

Dependence on censoring rate N /n in all of the results in this section is explicit and bares interests 
on its own. The last statement (39) shows that model selection properties of nonparametric Cox, 
up to a inverse factor of censoring rate, match those of penalized simple linear regression problems. 
In asymptotic studies of risk properties in Cox models, it never appeared explicitly. With the 
increase in sample size n, the number of uncensored observation increases and N /n asymptotically 
converges to a constant. Hence, studying non- asymptotic properties of estimators with censored 
data becomes significantly more important. We believe that almost no previous work has retrieved 
explicit dependence of risk properties of censored models on censoring rate. 



5. Sparse Non-Convex Group Selection with Convex Guide 

By using the guidance of convex group selection, in this section we propose a non-convex extension 
of the proposed FGP family of penalties where the convexity of function p does not have to be 
restricted. Although good risk properties of Lasso and group Lasso have been well documented, 
their non-convex counterparts have not been much studied from the perspective of finite sample 
oracle risk bounds. Bradic et al. (2011) studied variable selection properties of a class of non-convex 
relaxations of Lq penalty, called folded concave penalties, firstly introduced in Lv and Fan (2009), 
but do not mention finite sample oracle risk bounds. Moreover, family of non-convex functions 
needed to be restricted by upper bound on concavity property of p. If an equivalent condition is 
made for this setting, then all the previous results (Theorems 2 -3) would hold for such functions 
p (with changed RE condition to the one of this section). 

It is of interest to see what are the finite sample risk properties of group non-convex relaxations 
to structured Lq penalty and whether we can relax concavity restrictions. In this section we give 
results in support that they show good and in most cases better risk oracle bounds and that the 
concavity bound can be lifted. Let us start with the definition of class of concave penalties through 
the following Non-Convex Assumption. 

Non-Convex Assumption NCA: Let p{t;Xn) be increasing and concave mt£ [0,oo), even 
with respect to t and zero at f = 0. This assumption is generalization of previous classes of concave 
penalty functions as defined in Lv and Fan (2009), Zhang and Zhang (2012). For better prediction 
properties we are willing to sacrifice differentiability of penalty function and continuity of regularized 
estimator. Apart from defining class of penalty functions, we need to restrict our analysis to a class 
of design matrices X, that satisfy non-convex analog to restricted eigenvalue condition RE(^, 5,7) 
of Section 4 as follows. 
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Non-convex Restricted Eigenvalue Assumption RE(/x, s, p, 7): There exists a positive 
number Cp = Cp(s) > sucli that 



jA^X/^-Cn{f3* +SA)A 
(40) min — ^ > Cp, 

Aec^.,p,A#o,5G(o,i) p{\\Am,\\i,-,) 

where p'^{\\AM,\\i,-y) = X^jeAi, Pdl^jlki)^ ^^'^ restriction set C^,p is as defined in (28). Note 
that the shape of the restriction set changes with the penalty function p. For those convex penalty 
functions, it becomes a subset of corresponding ones for non-convex choices of p. Hence, restricted 
eigenvalue condition RE(s,/i, p, 7) for non-convex p implies RE(s, ^^,7) and Cp < C^- 

We propose the following two stage estimation scheme. Let the initial estimator be defined as 
^1.7 penalized empirical risk minimizer 

p 

3„ = arg min {7^„(b) + A„,o V d^^^^ ■ ||b,||^,}. 

As it is a convex minimization criterion. Theorem 3 applies and we can conclude that there exists 
a c e (0, 1) and 3* = c3o + (1 - c)/3* = (3* + c(3„ - /3*) such that 



For such defined (3^^ let us define a random local elliptical neighborhood B{(3^) centered at /3* as 

B(3j=|b:||/b-^Hi:j*<16%^ E '^"^ 
y ° ^ jeM, 

Then, we are able to define a non-convex estimator (3p as a solution to the following stochastic 
optimization problem 

f 

(41) /3p = argmin<^ 7^„(b) -t- A„ ^ p(||bj||^^., 

Note that this minimization problem is nontrivial, i.e. P{l3p 7^ 0) > P{£n), for £„, defined in 

Theorem 1, as on it there exists a feasible point in the constraint set B{(3^). Moreover, zero is not 
an absorbing state of this scheme. Once group j has been identified as zero in the first step, it has 
positive probability of being not equal to zero in the second step. 

In more details, we assume that there exists j such that (3^j =0 and /3* 7^ 0. Then fS^j = 
(1 - c)(3* ^ and (b^ - /3*)^V„(3Ij)(bj - (3*) ^ with VniWo.j) being a dx d submatrix of 

V„(/3o) corresponding to j-th group. This on the other hand implies that P{l3pj 7^ 0) > in 
comparison to many thrcsholded variants of Lasso procedure (see for example Mcinshausen and Yu 
(2009); Zhou (2010)) that do not have this property. This property will guard us against admitting 
false positives (that Lasso traditionally exhibits) while pertaining good prediction properties. 

Theorem 5 (Non-convex global SOI). Let f3p be as defined in (41) with the non-convex penalty 
function p satisfying Condition NCA, and let the assumption RE^B, s, p,'y) hold for such choice of 
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p with Cp = Cp{^)- Then, for some positive constant ci > 4, with probability 1 — 6pdexp {— cinA^}, 
under Condition 1, 

(42) 11/3 -fp4l<{^ + m + 5) min ^ {2||/b - |ln + 3A„} , 

for a nonnegative constant e — ^ — , 5 ~ cxp{4Cr„} — 1, r„ = ^ 2C^" '^^'^ ~ '^^W^' 

Note that Theorem 5 does not require any global optimality properties of the non-convex min- 
imizer and in that sense it does not require any concavity bound on the function p. Actually for 
most non-convex choices of p the FGP function (7) is neither convex or concave. In that sense the 
results of Theorem 5 extend to hold even for, computationally untractable and discontinuous, two 
step Zo,-y group penalty 

;o,^(b) = f]i{||b,|i^^. ^0}. 

As a corollary of previous theorem, we can conclude that Zg.-y regularized estimator /3;^ with convex 
guide (defined through (41) with p = Iq), with probability no less than 1 — 6p(iexp{— 
satisfies the following oracle inequality 

\\f',,^-f,'\\l< 

-exp{l2V2C^^} min _ L\f^-fp,\\l + 72^]. 

Here /?„ is defined as minimum signal strength /3„ = min{||/3* ||-y^ , j G A^*}. 

The choice of A„ is governed by the rate at which p'{0+) converges to zero, if we assume 
differentiability of p. Hence, in those cases Xn,o ^ A„. Moreover, if the number of dictionary 
functions d is smaller than ^/n, class of non-convex penalties p that satisfy RE(s,7, p, 7) with 
^-i/max7j^2 < (^'^ < (^'^ has Smaller upper risk prediction bound. Implication of this result is the 
following proposition. 

Proposition 3. Assume that the size of the dictionary functions is smaller than y/n. Let 
(7 = supfgjQ ^] ||-BV„(/3*, t)||„iax o,nd let A > A be a constant. Then, two stage group SCAD penalty, 

with the choice of Xn > A(7^J^^^^, will have smaller finite sample oracle risk bound than that of 

a group Lasso penalty with the choice of A„ > Acr\J'^^^- . 

On the other hand, convolution of non-convex p and loo norm, for example, SCAD and lea norm 
of Wang et al. (2009), has worse oracle risk properties in the sense of bigger error bound, than that 
of /i, Zoo of Negahban and Wainwright (2011) in nonparametric Cox models. Further, finite sample 
risk properties of non-convex penalties can be heuristically discussed as follows. 

Since, no single non-convex penalty beats all others, we design a sequence of non-convex group 
regularized estimators with convex guide. The choice of p is taken as the sequence of following 
newly proposed non-convex penalties 



PrS) = m\ > r,} + Jl - . i{\t\ < rfc}. 



Curvature of an ellipsoid in small neighborhoods achieves selection properties. The sequence of 
non-convex estimators (3p^ with a sequence of r^'s chosen as A„, • • • , -^f, • ' ' will lead to 
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an estimator with prediction properties approximating Iq penalty. For ti ~ A„ it is equivalent to 
group lasso estimator and for = A„/2'^ it is a constrained problem as in (41) with constraint 
set B{f3p^ ). It can be viewed as the approximation of computationally unachievable but highly 
desirable regularized group estimator. 

Let us make brief remark on implementation prospective of the proposed estimation scheme. 
Neighborhood B{(3g), centered at the true unknown (3*, is not achievable empirically. Hence, we 
propose to use the same type of elliptical neighborhood now centered at f3^ with the same radius 
size, i.e. at 

[ " ^ j&M, J 

This will inherently result in a small change in optimization problem (41), i.e. in doubling the size 
of B{(3^) but will not affect the result of Theorem 5. 

The tuning parameter A„.o is chosen by oracle inequality and varies with the choices of 7j's (see 
Section 6). For a typical choice of group Lasso, \n,o is proportional to ^log{pd)/nd. 



6. Examples 

In this section we show particular cases of group penalty functions (7) and show how they relate 
to previous work. We consider heterogeneous choices of , for which different choices of A„ will be 
appropriate. By allowing hierarchical and structure within each group j and among groups, we pay 
the penalty of having to choose larger tuning parameter than that in independent group settings. 



6.1. Hierarchical Selection and CAP. Previous work applies to a class of more complex additive 
models where the groups in the additive model may share some but not necessarily all features across 
groups. That way each function fj can be approximated with a different choice of functions 4*. In 
more details, based on prior information, each fj can be approximated by bp^ : where Tj is a 

set of covariates that belong to group j. Regularized estimator (3 is then defined as the minimizer 
of 



-E / Eb?,*r,dA^.(i)+ / log -E^«Wexp{Eb?.*r, 

i=i "'o j=i "'o y i=i j=i 



} dNit) 



+ E^"^^-|r.f/"^*l|br,||,,, 
i=i 

where iFj l stands for the cardinality of that set. Note that this setup incorporates classical group 
Lasso setting as well, where one would select all •jj = 2. Then, for some constant A > 4 and the 
choice of 

A„j > Aa\/i°iM|r^.|-2/7l^ sup ||£;V„(r,t)|lmax, 
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once RE(3,s,7) condition is satisfied, with probability at least of f — 6{pd} 



N 



72 




(43) min <^ 2||/b - |l; ■ ,,.2 

Previous oracle inequality is one of the few that discusses high dimensional finite sample properties 
of CAP family proposed in linear models in Zhao et al. (2009). Proof of this result is a simple 
modification of the results presented in the paper with A„ being adaptive to each group Tj and 
is therefore omitted. Block h/loo penalty as proposed in Negahban and Wainwright (2011) is a 
member of CAP family. They show empirically that if the overlap among groups is not large enough 
it behaves worse than plain lasso estimator. Since our groups share big part of this structure due 
to choice of functions "i!, it is not surprising that this penalty has the best properties among all 
members of CAP family. 

In comparison to previous results obtained for group Lasso, the choice of A„ is chosen larger due to 
the overlapping structure among groups. In nonparametric setting, non-overlapping case of groups 
(like multi-task learning case in Lounici et al. (2011) for example) cannot hold and expectedly we 
pay the price in terms of worse prediction properties. Moreover, the implicit assumption on the 
censoring rate N/ n and the choice of the number of basis functions d, takes the form of s^\og{pd) < 
N. The more events we observe, i.e. the smaller the censoring rate is, the bigger number of basis 
functions we can choose and the larger p and s we can handle. In the classical case of d ~ n~^/^ 
and no censoring N ^ n, the previous constraint becomes log(p) < ilogn+ p-, which implies that 
dense problems with s > v}l^ can not be efficiently retrieved. 



6.2. Smoothed Selection. Throughout previous sections we simplified the technical details and 
left out the smoothing component of the penalty part. Here, we would like to show that the work 
extends to this situation with a few adaptations needed. For that end, let us define the penalized 
smoothed estimator as 



p 



3s = argmm \ 7^„(b) -f A„ ^ Vdp ( ||bjRj||^^. + ■y/bjM,b, ) \ , for 7, > 2 

for convex and subadditive choice of function p and where the smoothing matrix Mj e j^d-Kd 
(Meier et al., 2009) contains the inner products of the second derivatives of the B-spline basis 
functions, that is, {Mj}ki = j j,{xj)'^ i {xj)dxj, k,l = I,-- - ,d and Rj G j^dxd -g matrix 
obtained from Cholesky decomposition of Mj i.e. Mj ~ RJRj. Note that the smoothness of 
functions ^ , require 7j's to be chosen larger than the order of desired smoothness of functions as 
too few number of basis functions cannot guarantee smoothness. Then we can rewrite the problem 
as 



(44) 3s = argmin <j 7e„(b) + A„ ^ Vdp (||b 



b 



Jll7j 
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with b, - R,b, and 7^„(b) - -i ^Li /o" E,^=i b/Rj'*!^,,)^^^ W + Jo" log^i"^ (b, i)dAr(t), 

iSi°''(b,i) = :^ Sr=i ^»(^) ^-'^P{Sj=i bj ^J^^i^ij)}. Crucial part of extending previous results to 
this novel setting requires extending results of Lemma 1 to the new penalty structure, stated in the 
following Lemma. 

Lemma 3. On the event £„ = {|ih„j(;a*)||^. < 2A„A/dp'(0+), Vj e {!,••• ,p}}, with h„,j(,3*) = 
Jq {En,j{f3* ,t) — Rj^"^ {Xij))dNi{t) , result of Lemma 1 holds for penalty used in (44). 
Moreover, results of Propositions 1 and 2 hold for this new empirical risk function with and lo 
respectively substituted with x minjgjx ... pjlAminCR-j)} and lo^ defined as 



w„ := mm n „ i c • 

=s .6{i,..,„},.euj^,K, I ^,g^^exp{E^^,bjRTi*(X,,)} J 

With the help of this result, following previous proofs and closely monitoring triangular inequali- 
ties and duplication of constants due to bigger penalization, we can conclude that for some constant 
A > A and the choice of 



Xn>Aaxr^^, a= sup ||i?V„(/3*, i)||max, 
V nd tg[o^^] 

once RE(3, s, 2) condition is satisfied, with probability at least of 1 — 6{pd} 



1~A 



(45) 11/3 /^.||2<— exp 48C-^ min }2\\f^-fp4l + —s\l 

The previous result is a unique finite sample result on prediction properties of non-parametric 
smoothed estimator in high dimensional Cox model. Although tackled as the last problem, its 
importance lies in inadmissibility of such results with techniques that already exist in the literature. 

Appendix A. Appendix 

A.l. Proofs of Lemmas. 

Proof of Lemma L This can be seen from the following reasoning. Let us define a function 

/(b) := g„,o(b) + nKP{h) = [h - /3*)^h„(/3*) + A„P(b) - C,,{(3*). 

First, we will show that zero is a local minimum of the function /(b) for all b e S(0, 1) such that 
||bj||i < 1. Note that /(b) - /(O) = T.%i -bjh„j(;a*) + X^d^/^'^ p{\\hj\\^^) and conditional on 

the event = {||h„,,(r )||7- < A„di/^.*p'(0+)}, 

/,(b,)-/,(0,) = -bJh„,,(r) + A„di/%*p(l|b,IU,) 

> ||b,||^^ (-||h„,,(/3*)||^; + A„di/''.>'(0+)) > 0, 

where we have utilized Hoelder inequality. Hence we can conclude that /(b) — /(O) > if the event 
£n = n^^]^£„j . Since / is a convex function we can conclude that is a global minimum as well. 
Note that we don't require unicity of minimum. 

□ 
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< 



max 

l<j<p,l<fc<d 



max 

i<j<p,i<fe<d 



max 

l<j<pj<k<d 



{sL'^}Af3\t)~{s(^nMf3\t) 



Proof of Lemma 2. From Condition 1 and the following decomposition: ||E„(/3*,t) — e(/3*,i)|| 

(47) 
(48) 
(49) 



s(o)(r,i)l 



< 



abi 



oi 

b ^ bib-a)' 



To see that the last inequality is correct we follow two arguments stated as (i) and (ii): 

(i) From Condition 1 we notice that info<t<r |S'^°' | > & - a > for > a. If Sn\l3\t) > 
g{a) ^ ^-j then utilizing Condition 1 and s'"' (/3* , t) > 6 > we conclude Sn"^ {(3* ,t) > b > b—a, since 
a > 0. liS^n\f3\-t) < s(°'(/3*,t) then notice that \S^°\l3* ,t)- s^°\l3* ,t) \ = s^°\l3* , t) - S^°\l3* , t) . 
From Condition 1 we have that in this case s'^'^\f3*,t) — Sn \(3*,t) < a which is equivalent to 
5i°'(/3*,i) >b-a. 

(ii) If Condition 1 holds, then there exists a constant ai > such that for every I < j < p 
and I < k < d, \{Sn^}jki(3*,t) — {s^^^}jfe(/3*, i)| < oi almost surely. This easily follows from the 
following inequality, where |4'fc(Xij)| < ci (by boundedness of functions 4*) and |s^°)(/3*, t)| < £2 
(by assumption on \SnH(3*,t) — s^'^\f3*,t)\ < a < 00) for some positive constants ci and £2 

\{Si'^Mf^*,t) ~ {s(i)U(/3*,i)| < \{Si%,if3*,t)\ + |{s(i)},,(r,i)l 

(50) < ci\Si'Hf3*,t)\+b, 

(51) < c^\S^°\f^*,t)-s^°\(3*,t)\ + \s^"\f3*,t)\+b^ 

(52) < dia + £2+bi 

Moreover, notice that ai can be taken such that ai > 1 by construction in Condition 1, hence 
(49) is smaller than or equal to (ai + abi)/{b — a) < 2a\/ {b — a). □ 

Proof of Lemma 3. We need to adapt Lemma 1 with 

/,(b,) - /,(0,) = -bJh„,,(/3*) + X„Vdp (llb.ll^, + \\h,h) 

> \\h,\U^ (^-!|h„,,(/3*)||,. +A„^/rfp'(0+) + . 
For 7j > 2 we know that 1 1 bj 1 1 < 1 1 bj j 1 2 leading to the conclusion that 



/,(b,)-/,(0,)>||b 



,(r)ll7;+2A„^/dp'(o+)), 



which leads us to conclude that the result of Lemma 1 hold for this particular penalty as well on 
a set £n = {\\hnj{(3*)\\j* < 2A„-\/dp'(0+), Vj £ {1, ■ • ■ ,p}} whose size is easily deducible from 
Theorem 1. 

Proof of equivalent of Proposition 1 easily extends having at mind that equivalent of V„ (b) has 
extra terms, which will factor into for example a, terms as (b — /3*)(R~^'>I'(Xi) — E„(/3*,<)). 
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R is diagonal block matrix as 



R 



/ Ri • • • \ 
R2 ••• 



V • • • Rp / 



With this altered V„(b) the proof follows exact same steps (since each Rj is invertible dx d matrix 
with bounded eigenvalues) as proof of Proposition 1 and is therefore omitted. 

The definition of weights ^^(b) in the proof of Proposition 2 will be changed to address new 
weighting matrix Rj . Once they are redefined the proof of equivalent of Proposition 2 will follow 
similar steps, hence we omit the details here. □ 

A. 2. Proofs of Propositions. 

Proof of Proposition 3. For all A G Ct^scad and p = SCAD, 



^M, 111. 2 



< 



jeM, 

Analysis of the upper bound will be split into two parts. First, note that 

E p^(||A,||2)< E max|p'(0+),H^^MV i|A,||^ < S^(A) ||A 

c A>f ^^ ^ II J II- J 



M.lll,2> 



jeMt jeM 



where ^p(A) = max |p'(0+), sup^g^^ {^^fsTJIT^}} by Proposition 1 of Zhang and Zhang (2012). 
For all A G Cy^scAD the supremum in the definition of S is reached among those j for which 
ll^ilh < A„, and for those j we already know that p^(||Aj||2) = A^J|Aj||2 < || AjUl. On the other 
hand for all A e Cv^scad such that infj-gwi, ll'^jlb > Ki, it is easy to see that if Xnn^^'^ > 2p'{0+) 
we have 

sup Sscad(A) < n^/". 

^6C7_sCAD.infieAi. ||Aj||2>A„ 

From Theorem 5 we see that the optimal choice of A„ for obtaining SOI of group SCAD penalty 
is A„ > Aa,J^^^, {where A and a are defined in Section 6) for which the previous requirement 
becomes trivially satisfied as p'(0+) < ^/n and '°g^?V^) > ^p>j2(^Q_^_y □ 

A. 3. Proofs of Theorems. 

Proof of Theorem 2. Let VLn = |||h„j(/3*)||^. < A„d^/'^^'p'(0+)/2, G {1, . . . Following sim- 

ilar steps as with bounding in Theorem 1 we know that P{Vln) > 1— Gpdexp ^—cind^~^^^X'^p'^ {0+ 
On r2„ we have (from (18) and (19)) that 

p 

(53) Wfp-fp'Wlrp' ^ ll/b-//3*lllb*+A„E^'^^^* (p(ll3,-b,|l,,)+2p(|lb|l^J-2p(|l3^.|l^j) , 

for all b^ = c/3 + (1 — c)/3* and b* = cb + (1 — c)f3* as defined in Theorem 1. From hereon they 
will be fixed. The proof follows lines of work on SOI problems (see for example Bickel et al. (2009); 
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Lounici et al. (2011)). For all b that satisfy s(b) < s, with s(b) denoting the sparsity of vector b, 
we have that 

(54) ll/3-//3*lll3*+A„EdV7;p(||3^._b,||,^.) 

< ll/b-//3.|l',b.+2A„f]di/^.* (p(||3^.-b,||^J+p(||b||^J-p(||3,lk,) 
< ll/b-//3- 11,1b- +2A„ d'^'' (p(ll3,-b,lk.)+P(l|b|k,)-p(!l3,|l7,) 

j<£M, 

From triangular inequality for the FGP function we have pdlbjlj^^) < piWf^j — ^jWij) + PiWf^jWij) 
leading to 

(55) +A„f]d^/^^*P(I!A,|U,) < l|/b-/^^*lln,b. +4A„ J2 d'^-'hiW^.lU, 

j=i jeM, 

where A = /3 — b. 

We consider two cases : 

(i) 4A„E,eA.. d^'-'hiW^jU) > \\h-fP'\\U'^ 

(ii) 4A„E,e>,. d^^^hm^K) < 
Case (i) From (55) we have 

II/? 11^,3- +A„f]d^/''^'p(l|A,||,,) < 8A„ d^Z-'hiW^jK)- 

This implies that J^jeM- d^^'^hiW^jh,) < '^Y.jeM. d^^'^hiW^jh,) t^^t A e Cy.p as defined 
in (28). For such A, \\p{AM,)h = JELi P^iW^jW-y,) < W^Mjlm and (29) we have that 



Y d'/^^*p(l|A,||,,) < V~d Y P'iW^^lU < ^d\\^M,hn 
jeM, y jeM, 

(56) < ^'min{|i/^-/b||„3.,||/^-/b||„.b*}, 

for d = J2jeM, d^^'^^ ■ Hence we obtain 

11/3 - //3-lllb* + A„^di/7;p(||A,|l,^.) < ^A„min{||/3 - /b||„,3-, II/3 - /b||„,b-} , 

which together with triangular inequality and min{a + 6, c + d} < min{a, c} + d for any a, &, c, d 
nonnegative entitles us to conclude that 

\\h-^p^trp' < ^'A„(a. + ii^.-/bii„bO, 

for w = min{||/^ - //3' IL.3* JI/3 - //3*l|n.b*}- On the other hand, from (55) (for b = (3*) we 
have a;2 < ||/- - ||^ 3* < 4A„ EjeM, d^^^hi0j - /S^IU.)- Combined with (56), this gives that 
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which gives us lj and 

Wfp-fp'Wlr^-' < pdA^ + ^'A„ii//3*-/biu,b. 

(57) < pdA^ + ^'A„||/;3*-/b|U,b. + ||//3. -/bll'„.b.. 

Remember that for case (i) wc assumed \\fh — fp' Wnh* — J2jeM (b) d'^^'^' PiW^jW-fj) which with 

(56) gives us 

!l/b-//3.|l,V < 4^miu{||/3-/b||„3.,||/3-/b!|„,b.} 

< 4^(.. + l|/;3.-/b||„.b0 

c 

Solving this quadratic inequahty gives us that j|/b — //3*||n,b' < (2 + \/5)y/sXn/C- Together with 

(57) this leads us to: 



< 16J^ + 4—^11/^. -/b||„,b.. 



which gives us that for every arbitrary b we have 

7; 

< - 



Wfp-fp-Wlr^' < ^ d2/^.*+2ii/b-/^HIlb. 



Case (ii) From (55) we have 

<^Xl d^/-^'+2||/b-/^.|l,lb.. 



□ 



Proof of Theorem 5. Probability of event r2„ can be easily derived from Theorem 1 hence we omit 
the details. From the definition of the non-convex minimizer f3p we know that for all b € B{(3^), 

p p 

7^n(3p) + A„^p(ll{3,},ll^J<7^„(b) + A„^p(||b,||^J, 

leading to Wf',^ ~ f.'WlcP.,^, < 

ll/b - f(3'\\U' + 2 (3p - b) Kin + 2A„^ (p(||b,||^J - pi\\0,},\U) . 

Now we need to check if equivalence of Lemma 1 holds for non-convex choices of p. Global optimality 
condition of this lemma certainly won't hold, but the local optimality still holds. As we are interested 
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in b e B{l3g), local optiniality suffices for our needs. Let us now define the event fin as i7„ — 
|||h„j(/3*)||^. < A„/2,V? e {1, . . . On this event, for all b € i5(3j we have 

-bJh„,,(/3*) + A„p(||b,||^J > ||b,||^^ (-||h„,,(/3*)||,; + A„) > 0, 

as all non-convex penalties in this local neighborhood are lower bounded by li penalty. Then, we 
can conclude that 11 fs — fa*\\'^ ,7, ^ is no bigger than 

II /b - ff3' ll^.b* + A„ II {34, - b, + 2A„ ^ (p(||b,||^,) - P(||{3p},ll7,) 

J=l J=l 

In local neighborhood B{f3g), the concave function p is upper bound of absolute value function. 
Using its concavity and non- increasing property (which imply subadditivity) we have ||/^ — 

+ ^nEUPiWi^ph b,l|,J < ll/b - /;3.|l,lb. + 4A„E;=iP(ll{3,b - b,l|,,), for 
some b* = cb + (1 — c)/3* and c = c(b) e (0,1). Following the same line of reasoning as in 
the proof of Theorem 2 (following lines after equation (55)), but for concave constraint set Cj^p 
and concave restricted eigenvalue condition RE(s,7,p, 7) we have that there exists c G (0,1) and 
{f3p}j3* = cf3p + (1 — c)(3* such that 

r 72 
(58) II//3 -//3-ll,%3 , < min _ 2||/b - /^g. |llb* + T^^A^ 

Gaussian type oracle bounds now easily follow from extension of Theorem 4 to non-convex 
penalties. This extension can be done with almost no change in the previous techniques on convex 
parts except that we should be mindful that the penalty norm takes different form. □ 

A. 4. Proofs of Corollaries. 

Proof of Corollary 1. The upper bounds follows directly from Proposition 2 as it is uniform bound 
over the whole parameter space. The lower bound follows by repeating the same steps as in 
Proposition 2 and definition of the weigh vectors w,;(/3*) in (15). □ 

Proof of Corollary 2. From Theorem 1 we have that 

II./3 - h' ll'.b^ < ll/b - fp' \\l.b> + 4A„F(b), 

for any vector b. 

From Proposition 2 we have that for w > c > 0, 

||/3-//3-||',b3 >^ll/3-//3-||', and ||/b-//3.|l,lb. <A^II/b-//3HI'- 

From the last three inequalities we conclude that Lo\\fp — fp* ||^ < A^||/b — fp* lln +4A„P(b), that is 

If we define e to be such that 1 + e = N/ui then the result in (27) holds. 

□ 



\fp - fp'Wl < -ll/b - fp4l + -XnP{h) 



Proof of Corollary 3. The proof follows easily from the result of Theorem 2 and similar arguments 
as in the proof of Corollary 2. □ 
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Proof of Corollary 4- The proof follows from results of Theorem 3 and taking similar steps as in 
Theorem 4 and is therefore omitted. Second statement is a simple sub result of Theorem 3. □ 
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